446 research outputs found
A robust approach to model-based classification based on trimming and constraints
In a standard classification framework a set of trustworthy learning data are
employed to build a decision rule, with the final aim of classifying unlabelled
units belonging to the test set. Therefore, unreliable labelled observations,
namely outliers and data with incorrect labels, can strongly undermine the
classifier performance, especially if the training size is small. The present
work introduces a robust modification to the Model-Based Classification
framework, employing impartial trimming and constraints on the ratio between
the maximum and the minimum eigenvalue of the group scatter matrices. The
proposed method effectively handles noise presence in both response and
exploratory variables, providing reliable classification even when dealing with
contaminated datasets. A robust information criterion is proposed for model
selection. Experiments on real and simulated data, artificially adulterated,
are provided to underline the benefits of the proposed method
Modelling heterogeneity in Latent Space Models for Multidimensional Networks
Multidimensional network data can have different levels of complexity, as
nodes may be characterized by heterogeneous individual-specific features, which
may vary across the networks. This paper introduces a class of models for
multidimensional network data, where different levels of heterogeneity within
and between networks can be considered. The proposed framework is developed in
the family of latent space models, and it aims to distinguish symmetric
relations between the nodes and node-specific features. Model parameters are
estimated via a Markov Chain Monte Carlo algorithm. Simulated data and an
application to a real example, on fruits import/export data, are used to
illustrate and comment on the performance of the proposed models
Bayesian nonparametric Plackett-Luce models for the analysis of preferences for college degree programmes
In this paper we propose a Bayesian nonparametric model for clustering
partial ranking data. We start by developing a Bayesian nonparametric extension
of the popular Plackett-Luce choice model that can handle an infinite number of
choice items. Our framework is based on the theory of random atomic measures,
with the prior specified by a completely random measure. We characterise the
posterior distribution given data, and derive a simple and effective Gibbs
sampler for posterior simulation. We then develop a Dirichlet process mixture
extension of our model and apply it to investigate the clustering of
preferences for college degree programmes amongst Irish secondary school
graduates. The existence of clusters of applicants who have similar preferences
for degree programmes is established and we determine that subject matter and
geographical location of the third level institution characterise these
clusters.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS717 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Variational Bayesian Inference for the Latent Position Cluster Model
Many recent approaches to modeling social networks have focussed on embedding the actors in a latent âsocial spaceâ. Links are more likely for actors that are close in social space than for actors that are distant in social space. In particular, the Latent Position Cluster Model (LPCM) [1] allows for explicit modelling of the clustering that is exhibited in many network datasets. However, inference for the LPCM model via MCMC is cumbersome and scaling of this model to large or even medium size networks with many interacting nodes is a challenge. Variational Bayesian methods offer one solution to this problem. An approximate, closed form posterior is formed, with unknown variational parameters. These parameters are tuned to minimize the Kullback-Leibler divergence between the approximate variational posterior and the true posterior, which known only up to proportionality. The variational Bayesian approach is shown to give a computationally efficient way of fitting the LPCM. The approach is demonstrated on a number of data sets and it is shown to give a good fit
A mixture of experts model for rank data with applications in election studies
A voting bloc is defined to be a group of voters who have similar voting
preferences. The cleavage of the Irish electorate into voting blocs is of
interest. Irish elections employ a ``single transferable vote'' electoral
system; under this system voters rank some or all of the electoral candidates
in order of preference. These rank votes provide a rich source of preference
information from which inferences about the composition of the electorate may
be drawn. Additionally, the influence of social factors or covariates on the
electorate composition is of interest. A mixture of experts model is a mixture
model in which the model parameters are functions of covariates. A mixture of
experts model for rank data is developed to provide a model-based method to
cluster Irish voters into voting blocs, to examine the influence of social
factors on this clustering and to examine the characteristic preferences of the
voting blocs. The Benter model for rank data is employed as the family of
component densities within the mixture of experts model; generalized linear
model theory is employed to model the influence of covariates on the mixing
proportions. Model fitting is achieved via a hybrid of the EM and MM
algorithms. An example of the methodology is illustrated by examining an Irish
presidential election. The existence of voting blocs in the electorate is
established and it is determined that age and government satisfaction levels
are important factors in influencing voting in this election.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS178 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Robust variable selection for model-based learning in presence of adulteration
The problem of identifying the most discriminating features when performing
supervised learning has been extensively investigated. In particular, several
methods for variable selection in model-based classification have been
proposed. Surprisingly, the impact of outliers and wrongly labeled units on the
determination of relevant predictors has received far less attention, with
almost no dedicated methodologies available in the literature. In the present
paper, we introduce two robust variable selection approaches: one that embeds a
robust classifier within a greedy-forward selection procedure and the other
based on the theory of maximum likelihood estimation and irrelevance. The
former recasts the feature identification as a model selection problem, while
the latter regards the relevant subset as a model parameter to be estimated.
The benefits of the proposed methods, in contrast with non-robust solutions,
are assessed via an experiment on synthetic data. An application to a
high-dimensional classification problem of contaminated spectroscopic data
concludes the paper
- âŚ